A Python script to input, read, and analyse staff dose data in a database¶

Just like the first article in the Hacking Medical Physics series, we have prepared two parallel scripts. Below follows the Python version of Hacking Medical Physics. Part 2. Working with databases - a step beyond spreadsheets! The article accompanying this script was published in the Spring 2022 issue of European Medical Physics News.

In this tutorial we will use Python to:

  1. Read monthly staff dose data reports from Excel files supplied by a fictitious dosimetry vendor (a total of 28 files),
  2. Clean the data (commas to periods, data types and formats, empty cells etc.), which is a very common task when writing code,
  3. Create a SQLite database and upload the staff dose data,
  4. Query the database for various statistics on the staff dose data.

This tutorial, in the form of a JupyterLab notebook, is included in the repository. If you want to run the code, or modify it to do something else for you, just copy the blocks of input below and paste them into a Python IDE (e.g., Jupyter, Pycharm, Pydev, Visual Studio Code, Spyder, etc.).

Section 1. Importing libraries.¶

To achieve this task, first we must import some useful libraries. To import and clean our data we will use Pandas and NumPy. As in our previous article in the Hacking Medical Physics series, we use NumPy to manipulate data. Pandas is an amazing library for reading information into dataframes, which sort of look and act like spreadsheets but with much more functionality than our spreadsheet software (e.g., Excel). We also need the sqlite3 library, which allows us to create a database, store data and create SQL queries. Finally, we import a visualization library, plotly express, which is a nice alternative to matplotlib that was used in the previous article.

The sqlite3 library is included with the Python standard libraries. If you have not worked with Pandas and plotly before, you will need to download them together with their respective dependency libraries. Pandas requires NumPy, python-dateutil, and pytz (optional dependencies are numexpr and bottleneck for performance, as well as xlrd, which is not optinal for us since we want to read Excel files!), see https://pandas.pydata.org/docs/getting_started/install.html for more information. If you are working in JupyterLab, plotly requires ipywidgets, see https://plotly.com/python/getting-started/ for more information. Your Python IDE will tell you if you are missing some dependency to be able to run commands from an imported library.

In [1]:
import pandas as pd
import numpy as np
import sqlite3
import plotly.express as px
import plotly

Section 2. Reading and cleaning the data.¶

In this section we show how to read the individual Excel files containing staff dose data using Pandas. First, download the Excel files from the repository and put them in a suitable directory (e.g., /StaffDose). Then we read each Excel file via the read_excel command. Remember that we imported Pandas as pd for convenience, so function calls will begin with pd.. In the read_excel function we say that there are no headers (i.e., the Excel column names from our dosimetry vendor), and we skip the first row (there are headers!), because we want to assign new names for each column in the dataframe.

In [2]:
data = pd.read_excel("/StaffDose/StaffDoses_1.xls",
                     header=None,
                     names=["Customer_name", "Customer_UID", "Department",
                            "Department_UID", "Name", "Person_UID", "Radiation_type",
                            "Hp10", "Hp007", "User_type", "Dosimeter_type",
                            "Dosimeter_placement", "Dosimeter_UID",
                            "Measurement_period_start", "Measurement_period_end",
                            "Read_date", "Report_date", "Report_UID"],
                     skiprows=1)

Now that we have read the first Excel file it is time to start cleaning the staff dose data. There is a lot to do! Looking at the Excel files we can see that the dose data columns contain numbers, blank cells and letters. Also, as is common, we need to assign a period as our decimal separator instead of the Swedish convention to use comma. To achieve all these things we define a custom function called dose_data_cleaning. Note how the function is named, using a clear description of what is being done and separating each word with a "_" (this style of writing is called snake case).

To the function - a lot is going on here, let's break it down! In the definition of the function we say we are working with a Pandas dataframe (the common acronym for a dataframe is df) and that each column has a name of type string (str), which we will specifically call using our function. In the function we want to achieve seven tasks:

  1. Replace all commas with periods (df[column_name].str.replace(",", ".")),
  2. Where a staff dose reading column has the value NR (means that a dosimeter was Not Retuned), we create a new column in the dataframe called Status and add the value NR here (df.loc[df[column_name] == "NR", 'Status'] = 'NR'),
  3. Do the same thing as in 2. with all instances of a column value B (means Below Measurement Threshold),
  4. In the dose data column, replace all instances of NR and B with a value type Not a Number (NaN),
  5. Format all cells in the dataframe column that we are working with into type float (the staff dose data columns are a mess when we first read them from Excel, containing a mix of strings (letters), integers and floating point numbers),
  6. Replace all instances of NaN in our newly created dataframe column Status with OK, this will be all cells that initially did not have the values NR or B,
  7. In the final step we say that the function should return the dataframe after all the above steps to clean the data has been completed.

Note that we in several instances in this function do the equivalent of looping through the entire dataframe column and writing over the existing data, e.g., df[column_name] = df[column_name].str.replace(",", "."). This is a very efficient way of working with data compared to a traditional loop! One final thing, we imported the NumPy library to handle cells with NaN values (np.nan).

In [3]:
def dose_data_cleaning(df: pd.DataFrame, column_name: str) -> pd.DataFrame:
    df[column_name] = df[column_name].str.replace(",", ".")
    df.loc[df[column_name] == "NR", 'Status'] = 'NR'
    df.loc[df[column_name] == "B", 'Status'] = 'B'
    df[column_name].replace({"B": np.nan, "NR": np.nan}, inplace=True)
    df[column_name] = df[column_name].astype(float)
    data['Status'] = data['Status'].replace(np.nan, "OK")

    return df

Using our custom function, we can then execute the task of cleaning our defined staff dose data columns, Hp10 and Hp007, respectively.

In [4]:
data = dose_data_cleaning(df=data, column_name='Hp10')
data = dose_data_cleaning(df=data, column_name='Hp007')

Oh, there was another aspect of data cleaning, beyond the staff dose data columns - formatting all columns containing date information into a proper format! We can use Pandas to do this, first with the to_datetime function and then cut the time (which is not supplied from our fictitious vendor) using the dt.date command.

In [5]:
data['Measurement_period_start'] = pd.to_datetime(data['Measurement_period_start']).dt.date
data['Measurement_period_end'] = pd.to_datetime(data['Measurement_period_end']).dt.date
data['Read_date'] = pd.to_datetime(data['Read_date'], errors='coerce').dt.date
data['Report_date'] = pd.to_datetime(data['Report_date']).dt.date

Section 3. Creating the database and uploading our staff dose data.¶

In the Python script, unlike the MATLAB version, we do not read the 28 staff dose data files all at once. This means that you manually have to change the name of the input file in the second block of code and do all the data cleaning steps up to this point for all files. This is what you would do in a real-world situation where a staff dose data report is created each month. However, if you want to read any and all files in a directory it can be achieved with a tweak of the second block of code using Pandas!

For each Excel file read we need to put the data in the database. But we are getting ahead of ourselves, first we need to create our database with the imported sqlite3 library - conn = sqlite3.connect('staff_dose_db.sqlite'). The next command (c = conn.cursor()) gives us a database cursor, which has the function to give us access to the database (note that we assign the cursor to a variable c, which will be used from now on to communicate with the database).

In [6]:
conn = sqlite3.connect('staff_dose_db.sqlite')
c = conn.cursor()

With the database name assigned we can create a database table (staffdose), if there is not already such a table (CREATE TABLE IF NOT EXISTS staffdose) to hold our staff dose data. A nice operation when we will input data 28 times! Then we add names and data types (VARCHAR for letters, INTEGER is rather obvious, and DOUBLE for decimal numbers) for all the data that we want to put in the database from our dataframe, which is a clean representation of an Excel file containing staff dose data. Note that we add an id type as an identifier for the data input to the database. This identifier is also assigned as a PRIMARY KEY for our table, which is made so that we have a way to uniquely identify each and every row in the table. The final command in the table creation, UNIQUE(Report_UID, Person_UID, Dosimeter_placement), is important - this is how we prevent the inclusion of duplicate data in the database (each entry must be a unique combination of Report number, Personal ID number and Dosimeter placement).

The last line of code, conn.commit(), does exactly what it says - commit the SQL commands made to the database. In general, SQL syntax (the language of SQL) is quite natural and logical.

In [7]:
c.execute('''
            CREATE TABLE IF NOT EXISTS staffdose (
                [id] INTEGER NOT NULL PRIMARY KEY,
                [Customer_name] VARCHAR,
                [Customer_UID] INTEGER,
                [Department] VARCHAR,
                [Department_UID] INTEGER,
                [Name] VARCHAR,
                [Person_UID] INTEGER,
                [Radiation_type] VARCHAR,
                [Hp10] DOUBLE,
                [Hp007] DOUBLE,
                [User_type] VARCHAR,
                [Dosimeter_type] VARCHAR,
                [Dosimeter_placement] VARCHAR,
                [Dosimeter_UID] INTEGER,
                [Measurement_period_start] DOUBLE,
                [Measurement_period_end] DOUBLE,
                [Read_date] DOUBLE,
                [Report_date] DOUBLE,
                [Report_UID] DOUBLE,
                [Status] VARCHAR,
                UNIQUE(Report_UID, Person_UID, Dosimeter_placement)
            )
    ''')
conn.commit()

With the database up and running, a table with suitable columns to represent our staff dose data created, and a clever UNIQUE statement to not introduce duplicate data in our table, we are ready to send the information in our dataframe to the database. This is really simple using Pandas with the to_sql command! Note that we append the data from each dataframe (Excel file). We could also choose to replace which would not make sense here, and that we do not create an index (our PRIMARY KEY from the previous block of code takes care of that).

In [8]:
data.to_sql("staffdose", conn, if_exists="append", index = False)

Section 4. Data analysis.¶

With all the staff dose data collected in the database we are ready for some staff dose data analytics! The first example, a box plot, is what we showed in the Spring 2022 issue of European Medical Physics News.

To get started we will define a function to execute a query to our SQLite database, i.e., to ask the database for information. Again, as with the previous function that was defined earlier in the script - there is a lot going on, and making functions instead of hardcoding what you want makes things more convoluted (but very much more practical when you get used to working with functions). Further below in the script we have several examples on hardcoded SQL queries. Okay - we define a function execute_boxplot_query, which has our database cursor as input together with dosimeter Status, Dosimeter type, and the start and end of the data collection period, respectively. With the function input defined we can write a custom query (select_query) to our database (FROM staffdose):

  1. We choose (SELECT) which database colums that will be our output - Hp10, Department, STRFTIME('%Y', Measurement_period_end),
  2. The we have some constraints on Hp10, dosimeter Status, Dosimeter type, and the measurement period - WHERE Hp10 >= 0 AND Status = ? AND Dosimeter_type = ? AND STRFTIME('%Y', Measurement_period_end) BETWEEN ? AND ?,
  3. Then we execute the SQL query together with the defined function arguments,
  4. And finally we tell the function to return the database cursor.

As previously, SQL syntax is quite logical. However, note the somewhat convoluted function STRFTIME! SQLite does not have the more clear and convenient function YEAR, which is present in other versions of SQL.

Note how we have written this function, with the calls to arguments and the corresponding question marks (?) in the SQL query. This is called parameterization. If you do not parameterize (or hardcode, as further below in the script) your queries, the database will be vulnerable to SQL injection attacks, which has caused serious IT security problems over the years (https://en.wikipedia.org/wiki/SQL_injection). A both funny and serious commentary on the importance of protection against SQL injection can be found at https://bobby-tables.com - Remember little Bobby tables!

In [8]:
def execute_boxplot_query(cur, Status, Dosimeter_type, Period_start, Period_stop):
    select_query = ('''
    SELECT Hp10, Department, STRFTIME('%Y', Measurement_period_end)
    FROM staffdose
    WHERE Hp10 >= 0 AND Status = ? AND Dosimeter_type = ?
    AND STRFTIME('%Y', Measurement_period_end) BETWEEN ? AND ?
    ''')

    cur.execute(select_query,
                (Status, Dosimeter_type, Period_start, Period_stop))
    return cur

It is time to put execeute_boxplot_query to use! We select our database cursor (c), that the Status for all dosimeter readings should be OK, and that the measurement period should be year 2020 and 2021. Then we put the results (query_data_EMPNews) from this query into a Pandas dataframe using our cursor and a call to the database to collect and get all the query results (c.fetchall()). If you want to check out the content of a dataframe, simply type the name and you will get a description (the output will depend on your Python interpreter - PyCharm has a very powerful functionality for dataframes, reminding of the graphical representation of Excel spreadsheets). For convenience and later graphical visualization of results with plotly we name the columns in our dataframe - Hp(10), Department, and Year.

In [9]:
c = execute_boxplot_query(c, "OK", "Badge", "2020", "2021")
query_data_EMPNews = pd.DataFrame(c.fetchall(),
                           columns=["Hp(10)", "Department", "Year"])

We have finally arrived at the point where we can visualize our results on staff dose measurements, as shown in the Spring 2022 issue of European Medical Physics News. To replicate the box plot (https://en.wikipedia.org/wiki/Box_plot), showing statistics from our fictitious vendor, we use the plotly express library imported at the beginning of the script, which offers easy and short commands to access the impressive plotly library. Note the simplicity of the function call to create our box plot (input our dataframe, query_data_EMPNews, designate x- and y-axes, respectively, and color results based on the year of measurements performed)! To save the output as a html file we need to call a function the plotly library (plotly.offline.plot(fig_EMPNews, filename = 'Yearly_stats.html', auto_open=False)). However, we can also download a png version of the plot from the inline graphics below. Note that we can interact with the plotly graphics below, mouse over various parts of the figure to find out more!

In [10]:
fig_EMPNews = px.box(query_data_EMPNews, x="Department", y="Hp(10)",
                     color='Year', title='A box plot showing statistics for staff dose readings collected during 2020 and 2021',
                     width=1200, height=800)
fig_EMPNews.update_traces(quartilemethod="linear")
fig_EMPNews.show()
plotly.offline.plot(fig_EMPNews, filename = 'Yearly_stats.html', auto_open=False)
Out[10]:
'Yearly_stats.html'

With the box plot shown in the EMP News article out of the way we can take a closer look at SQL syntax and what there is to be found in our database. For instance, a simple query will tell us the earliest and latest staff dose readings on record. We SELECT the minimum and maximum of the start and end, respectively, of recorded measurement periods from our staffdose table and exclude events when dosimeters were not returned. Here we do not bother with putting the results into a dataframe, we simply print the results again using our cursor (c) and the fetchall() command in a loop over the rows of data resulting from the query. As the executed query results show, we have some partial coverage of 2019 and 2022 (which is why we only show yearly statistics for 2020 and 2021 in the box plot above).

In [11]:
c.execute('''
    SELECT MIN(Measurement_period_start), MAX(Measurement_period_end)
    FROM staffdose
    WHERE Status != 'NR'
    ''')

print("Earliest and latest records of staff dose readings, respectively:")
for row in c.fetchall():
    print(row)
Earliest and latest records of staff dose readings, respectively:
('2019-11-22', '2022-01-25')

Another interesting bit of information would be how many readings on staff dose that we have on record in the database. Here we can query the database for a COUNT of staff dose readings of Hp(10) and Hp(0.07), respectively, together with the Dosimeter_type, which we also GROUP our results BY.

In [12]:
c.execute('''
    SELECT COUNT(Hp10), COUNT(Hp007), Dosimeter_type
    FROM staffdose
    WHERE Status != 'NR'
    GROUP BY Dosimeter_type
    ''')

print("Number of staff dose readings on record, Hp(10) and Hp(0.07), respectively, "
      "per dosimeter type:")
for row in c.fetchall():
    print(row)
Number of staff dose readings on record, Hp(10) and Hp(0.07), respectively, per dosimeter type:
(403, 403, 'Badge')
(0, 47, 'Ring')

It is said that every story needs a great villain, here we will use an SQL query to find him (Spoiler: Luna just has more important things to do some of the time)! Basically, we build on the structure of our last query but we ask for Name and a COUNT of all instances when a dosimeter was not returned (Status = "NR"). We GROUP our results BY the Name of those responsible, and ORDER them BY the number of times dosimeters were not returned from high to low (DESC).

In [13]:
c.execute('''
    SELECT Name, COUNT(Status = 'NR')
    FROM staffdose
    WHERE Status = 'NR'
    GROUP BY Name
    ORDER BY COUNT(Status = 'NR') DESC
    ''')

print("Sum of all instances on record where dosimeters were not returned "
      "(and those responsible):")
for row in c.fetchall():
    print(row)
Sum of all instances on record where dosimeters were not returned (and those responsible):
('Tom Marvolo Riddle', 28)
('Luna Lovegood', 14)

A useful query to the database would be to show accumulated staff dose readings for a year, with names of staff and results divided into different departments (as our database has such information!).

As with the box plot, we create a parameterized function query to the database (execute_yearly_sum_query). With all the previous examples there is only one new thing here, via SQL syntax we ask the database to supply the yearly sum of staff dose for each person, rounded to two decimal points (ROUND(SUM(Hp10), 2)). Note that since we are using Pandas as a middle layer (holding our SQL query results and used as input to plot functions) between the database and visualization with plotly, we can also chose to perform data manipulation directly in our dataframe. In the following blocks of code, we: 1. use our custom function, and 2. make a bar plot from the query results using plotly express. We already did both of these things earlier, only with other types of data, so no further explanation is needed! Remember that the inline plots created with plotly are interactive, check it out!

In [14]:
def execute_yearly_sum_query(cur, year, Status, Dosimeter_type):
    select_query = ('''
    SELECT Name, ROUND(SUM(Hp10), 2), Department
    FROM staffdose
    WHERE Hp10 >= 0 AND STRFTIME('%Y', Measurement_period_end) = ? AND Status = ?
    AND Dosimeter_type = ?
    GROUP BY Name
    ORDER BY SUM(Hp10) ASC
    ''')

    cur.execute(select_query, (year, Status, Dosimeter_type))
    return cur
In [15]:
c = execute_yearly_sum_query(c , "2020", "OK", "Badge")
query_data = pd.DataFrame(c.fetchall(),
                           columns=["Name", "Sum Hp(10)", "Department"])
In [16]:
fig1 = px.bar(query_data, x='Name', y='Sum Hp(10)',
              color='Department',
              title="Total staff dose readings in 2020, badge Hp(10), "
                    "per department",
              width=1200, height=800)
fig1.show()
plotly.offline.plot(fig1,
                    filename = 'Personal_year_sum_parameterize.html',
                    auto_open=False)
Out[16]:
'Personal_year_sum_parameterize.html'

One final example, where we ask the database for the Status of the dosimeter readings (Valid reading, Below threshold, or Not returned) during 2020 and visualize the results using plotly. While this query will not give numerical results to identify those that may need additional radiation safety training (or protective gadgets), we already did that in the previous example, this will give a nice macroscopic overview on a monthly basis. If you happen to not live and work in the Harry Potter universe, with only 19 employees, consider using the y-axis to identify staff (this also applies to the previous plot).

Okay, is there anything new going on here? Actually, very little - you can do a lot with the simple (and we hope adequately explained) examples above! Since we are using SQLite we need the STRFTIME function to assign our monthly staff dose readings ('%m'), and we do some tidying of the results for better clarity in the plotly express scatter plot legend. Again, Pandas is pretty amazing, with a few short lines of code we loop through the Status column in the dataframe containing our query results and replace the cell values to suit our purposes (e.g., query_data2['Status'] = query_data2['Status'].replace("OK", "Valid reading")). In creating the scatter plot we make some custom choices with colors (color_discrete_sequence=["#00CC96", "#636EFA", "#EF553B"]), fiddle with the y-axis to make a good representation of months (fig2.update_yaxes(nticks=12, tick0=1, dtick=1, range=[0.1, 12.9])), and make the plot markers more pleasing to the eye, or at least we hope so, by adjusting their size as well as the border line width and color (fig2.update_traces(marker=dict(size=10, line=dict(width=2, color='DarkSlateGrey')), selector=dict(mode='markers'))). There was one more thing, when we color our results based on Status, we also need to order the y-axis so that months are shown in chronological order (fig2.update_layout(yaxis={'categoryorder': 'category ascending'})).

In [17]:
c.execute('''
    SELECT STRFTIME('%m', Measurement_period_end), Department, Status, Name
    FROM staffdose
    WHERE Dosimeter_type = 'Badge' AND STRFTIME('%Y', Measurement_period_end) = '2020'
    ORDER BY Name
    ''')

query_data2 = pd.DataFrame(c.fetchall(), columns=["Month", "Department", "Status", "Name"])
In [18]:
query_data2['Status']  = query_data2['Status'].replace("OK", "Valid reading")
query_data2['Status']  = query_data2['Status'].replace("B", "Below threshold")
query_data2['Status']  = query_data2['Status'].replace("NR", "Not returned")
In [19]:
fig2 = px.scatter(query_data2, x="Name", y="Month", color="Status", width=1200, height=800,
                  color_discrete_sequence=["#00CC96", "#636EFA", "#EF553B"],
                  title="Monthly report of badge dosimeter use in 2020")

fig2.update_yaxes(nticks=12, tick0=0, dtick=1, range=[-1, 12])
fig2.update_layout(yaxis={'categoryorder': 'category ascending'})
fig2.update_traces(marker=dict(size=10, line=dict(width=2, color='DarkSlateGrey')),
                   selector=dict(mode='markers'))

fig2.show()
plotly.offline.plot(fig2, filename = 'Monthly_dosimeter_status.html', auto_open=False)
Out[19]:
'Monthly_dosimeter_status.html'

Finally, to end the script we close the connection to our database.

In [20]:
conn.close()

We hope that you liked this tutorial. If you have any questions or suggestions on how to improve the script - let us know!

In closing we want to remind you to be mindful of regulations regarding data and code. It may take a vivid imagination to qualify a staff dose database as a medical device (phew, no need to consider MDR!), but in a real-world application we do handle personal (and thus sensitive) information so GDPR will most definitely apply. If you want to set up your own staff dose database you should contact the GDPR experts in your organization.

Jonas Andersson (jonas.s.andersson@umu.se)